knitr::opts_chunk$set(echo = TRUE)
library(tidyverse); packageVersion("tidyverse")
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.2 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## [1] '1.3.0'
library(yaml); packageVersion("yaml")
## [1] '2.2.1'
library(here); packageVersion("here")
## here() starts at /Users/sdm8/Desktop/changeseq-analysis
## [1] '0.1'
library(stringr); packageVersion("stringr")
## [1] '1.4.0'
Determine sources of variability for the CHANGE-Seq assay.
Downstream Bioinformatic Analysis:
| Letter ID | Target Site |
|---|---|
| A | CCR5_site_3 |
| B | CCR5_site_8 |
| C | TRAC_site_1 |
| D | CXCR4_site_7 |
| E | CTLA4_site_9 |
| F | AAVS1_site_14 |
Experimental Factors N: NIST S: St. Jude
| Acronym | Wet Lab | Sequencing Lab | Bioinformatics |
|---|---|---|---|
| NNN | NIST | NIST | NIST |
| NSN | NIST | St. Jude | NIST |
| NSS | NIST | St. Jude | St. Jude |
NOTE SSN: refers to the CHANGE-Seq manuscript/publication samples. See below for pre-run steps **Pending review by Cicera as samples do not seem to be matched depth-wise when compared.
NOTE NNN2: NNN same library prep sequenced on an expired MiSeq kit (details below) to test before running new prepped AEF replicate samples. When these were of adequate quality, we incorporated them into this analysis.
St Jude has not yet re-run our sequencing of the original A-F samples. (NNS)
Reference Genome Cicera: > You can download hg38 sequence from: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ and concatenate only the main chromosome sequences as hg38_chroms_only.fa
Downloaded all files with this naming scheme: “chrom1.fa.gz” + X & Y.
Concatenated files via:
cd ~/Desktop/reference_genome
cat *.fa > ~/Desktop/reference_genome/combined_hg38_chroms.fa
Trimmed NIST-sequenced files due to extra cycling with our kit via:
seqtk trimfq -e 150 \
A_S1_L001_R1_001.fastq.gz > A_S1_L001_R1_001_trimmed.fq
SSN files: Cicera: > Those samples were sequenced on a MiSeq V3 kit. For comparison of replicates, you will need to sample the gz files to the same sequence depth (between the replicates), and ideally, once you get the results, you would merge the tables and set a cutoff (read counts), because the low frequency sites are very likely to occur in one replicate but not in other. That’s what we use for the comparisons in figure 1. (edited)
you can use this command to check how many reads in each file:
parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz
Then, you can use this one to generate new files with the number of reads that you want:
seqtk sample -s100 read1.fq 1000 > sub1.fq
where 1000 is the number you want to sample to.
| Sample | File Name | File Size | Depth |
|---|---|---|---|
| A, read 1 | A_S1_L001_R1_001_trimmed.fq | 956.9 MB | 2631758 |
| A, read 2 | A_S1_L001_R2_001_trimmed.fq | 874.8 MB | 2631758 |
| B, read 1 | B_S2_L001_R1_001_trimmed.fq | 647.9 MB | 1780610 |
| B, read 2 | B_S2_L001_R2_001_trimmed.fq | 623.7 MB | 1780610 |
| C, read 1 | C_S3_L001_R1_001_trimmed.fq | 743.2 MB | 2049156 |
| C, read 2 | C_S3_L001_R2_001_trimmed.fq | 706.6 MB | 2049156 |
| D, read 1 | D_S4_L001_R1_001_trimmed.fq | 935.6 MB | 2586872 |
| D, read 2 | D_S4_L001_R2_001_trimmed.fq | 739.4 MB | 2586872 |
| E, read 1 | E_S5_L001_R1_001_trimmed.fq | 741.1 MB | 2056395 |
| E, read 2 | E_S5_L001_R2_001_trimmed.fq | 701.1 MB | 2056395 |
| F, read 1 | F_S6_L001_R1_001_trimmed.fq | 845.6 MB | 2339851 |
| F, read 2 | F_S6_L001_R2_001_trimmed.fq | 817.3 MB | 2339851 |
| Neg Ctrl, read 1 | neg-neg-control_S7_L001_R1_001_trimmed.fq | 858.1 MB | 2341975 |
| Neg Ctrl, read 2 | neg-neg-control_S7_L001_R2_001_trimmed.fq | 803 MB | 2341975 |
| Sample | File Name | File Size | Depth |
|---|---|---|---|
| A, read 1 | CRL1551_NIST_S1_R1_001.fastq | 956.9 MB | 2829840 |
| A, read 2 | CRL1551_NIST_S1_R2_001.fastq | 874.8 MB | 2829840 |
| B, read 1 | CRL1552_NIST_S2_R1_001.fastq | 647.9 MB | 2453359 |
| B, read 2 | CRL1552_NIST_S2_R2_001.fastq | 623.7 MB | 2453359 |
| C, read 1 | CRL1553_NIST_S3_R1_001.fastq | 743.2 MB | 2084289 |
| C, read 2 | CRL1553_NIST_S3_R2_001.fastq | 706.6 MB | 2084289 |
| D, read 1 | CRL1554_NIST_S4_R1_001.fastq | 935.6 MB | 2840809 |
| D, read 2 | CRL1554_NIST_S4_R2_001.fastq | 739.4 MB | 2840809 |
| E, read 1 | CRL1555_NIST_S5_R1_001.fastq | 741.1 MB | 2339453 |
| E, read 2 | CRL1555_NIST_S5_R2_001.fastq | 701.1 MB | 2339453 |
| F, read 1 | CRL1556_NIST_S6_R1_001.fastq | 845.6 MB | 2416305 |
| F, read 2 | CRL1556_NIST_S6_R2_001.fastq | 817.3 MB | 2416305 |
| Neg Ctrl, read 1 | neg-neg-CRL1557_NIST_S7_R1_001.fastq | 858.1 MB | 2604617 |
| Neg Ctrl, read 2 | neg-neg-CRL1557_NIST_S7_R2_001.fastq | 803 MB | 2604617 |
To account for extra sequencing cycles, samples sequenced at NIST were trimmed via:
seqtk trimfq -e 150
Both locations pooled samples according to St. Jude’s qPCR quantification values.
Samples run with CHANGE-Seq v1.2.7
St Jude has not yet re-run our sequencing of the original A-F samples. (NNS)
NOTE About SSN Files
I sampled the files per Cicera’s instructions and checked to make sure the number of sequences matched those of St. Jude’s A-F sequencing of our samples. However, when I started to plot them with our other samples they drastically outweighed ours and made it very difficult to see our samples on the graphs. I will include two examples here and we can discuss how we want to proceed. We may just need to compare within sequencing depths and not try to down-sample.
Total read count of all samples showing how the SSN replicates even when down-sampling drowns out our samples
Total on-target read count with the down-sampled manuscript SSN replicates showing that it makes it very difficult to see our other samples when plotted together
Loading CHANGE-Seq output txt files into a single data frame
## Using message = FALSE to avoid printing read_tsv message about column specifications
## List of changeseq output files
changeseq_lst <- list.files(path = here("data-raw"),
pattern = "identified_matched.txt",
include.dirs = TRUE, full.names = TRUE,
recursive = TRUE) %>%
set_names(basename(.))
changeseq_df <- changeseq_lst %>%
map_dfr(read_tsv, .id = "changeseq_run")
unique(changeseq_df$changeseq_run)
## [1] "NNN2_A_CCR5_site_3_GM24385_identified_matched.txt"
## [2] "NNN2_B_CCR5_site_8_GM24385_identified_matched.txt"
## [3] "NNN2_C_TRAC_site_1_GM24385_identified_matched.txt"
## [4] "NNN2_D_CXCR4_site_7_GM24385_identified_matched.txt"
## [5] "NNN2_E_CTLA4_site_9_GM24385_identified_matched.txt"
## [6] "NNN2_F_AAVS1_site_14_GM24385_identified_matched.txt"
## [7] "NNN_A_CCR5_site_3_GM24385_identified_matched.txt"
## [8] "NNN_B_CCR5_site_8_GM24385_identified_matched.txt"
## [9] "NNN_C_TRAC_site_1_GM24385_identified_matched.txt"
## [10] "NNN_D_CXCR4_site_7_GM24385_identified_matched.txt"
## [11] "NNN_E_CTLA4_site_9_GM24385_identified_matched.txt"
## [12] "NNN_F_AAVS1_site_14_GM24385_identified_matched.txt"
## [13] "NSN_A_CRL1551_CCR5_site_3_GM24385_identified_matched.txt"
## [14] "NSN_B_CRL1552_CCR5_site_8_GM24385_identified_matched.txt"
## [15] "NSN_C_CRL1553_TRAC_site_1_GM24385_identified_matched.txt"
## [16] "NSN_D_CRL1554_CXCR4_site_7_GM24385_identified_matched.txt"
## [17] "NSN_E_CRL1555_CTLA4_site_9_GM24385_identified_matched.txt"
## [18] "NSN_F_CRL1556_AAVS1_site_14_GM24385_identified_matched.txt"
## [19] "NSS_A_CRL1551_CCR5_site_3_GM24385_identified_matched.txt"
## [20] "NSS_B_CRL1552_CCR5_site_8_GM24385_identified_matched.txt"
## [21] "NSS_C_CRL1553_TRAC_site_1_GM24385_identified_matched.txt"
## [22] "NSS_D_CRL1554_CXCR4_site_7_GM24385_identified_matched.txt"
## [23] "NSS_E_CRL1555_CTLA4_site_9_GM24385_identified_matched.txt"
## [24] "NSS_F_CRL1556_AAVS1_site_14_GM24385_identified_matched.txt"
NOTE - can modify data frame to add lab name, e.g. StJude or NIST, either using file paths or separate data frame. Annotating using file names
## This code is specific to the input file name format
changeseq_anno_df <- changeseq_df %>%
## adding changeseq_run col, removing ending portion of string **NOTE THAT THIS WILL NEED CHANGED WHEN LOOKING AT OTHER DNA LINES**
mutate(changeseq_run = str_remove(changeseq_run, "_GM24385_identified_matched.txt"),
sample = "GM24385",
## target site name begins 5 characters into string of changeseq_run
target_site = str_sub(changeseq_run, 5),
## further edit target site col if contains the St Jude file prefixes of CRLXXXX
target_site = str_remove(target_site, "CRL[:digit:]*_"),
target_site = str_remove(target_site, "_sampled_"),
## lab col will contain the first 3 letter acronyms
lab = substr(changeseq_run, 1,4),
lab = str_remove(lab, "_")) %>%
## possible identifiers: NNN, NSN, NSS
## first = wet lab, second = sequence, third = bioinformatics
## take the 3 letters in lab col and split into 3 additional cols
separate(lab, into = c("wet_lab","seq_lab","bioinf"),
sep = c(1,2,3),
remove = FALSE) %>%
## change the col contents from just letters to what they indicate
mutate(wet_lab = if_else(wet_lab == "N", "NIST", "StJude"),
seq_lab = if_else(seq_lab == "N", "NIST", "StJude"),
bioinf = if_else(bioinf == "N", "NIST", "StJude"))
changeseq_anno_df ## Samantha may want this one written out
NOTE - The lab, target_site, and/ or changeseq_run variables can be used as grouping variables to calculate within dataset summarise.
Total Read Counts per Sample
total_count_df <- changeseq_anno_df %>%
group_by(changeseq_run, lab, target_site) %>%
summarise(total_read_count = sum(Nuclease_Read_Count))
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)
Total on-target read count
target_count_df <- changeseq_anno_df %>%
filter(Site_Substitution_Number == 0) %>%
rename(on_target_count = Nuclease_Read_Count) %>%
select(changeseq_run, lab, target_site, on_target_count)
Percent on-target read count (on-target read count/total read count)
on_off_target_df <- left_join(total_count_df, target_count_df) %>%
mutate(on_target_rate = on_target_count/total_read_count)
## Joining, by = c("changeseq_run", "lab", "target_site")
Total off-target read count Percent off-target read count (off-target read count/total read count)
summary_df <- changeseq_anno_df %>%
## replace NAs in Site_Substitution_Number col
mutate(Site_Substitution_Number = if_else(is.na(Site_Substitution_Number), -1, Site_Substitution_Number),
## assign read count to on-target col of on-target row per sample
on_target = if_else(Site_Substitution_Number == 0, Nuclease_Read_Count, 0),
## assign off-target read counts in non-on-target rows
off_target = if_else(on_target == 0, Nuclease_Read_Count, 0),
## seqs with subs only = seq in Site_Sequence col only
Sub_Only = if_else(is.na(Site_Sequence), FALSE, TRUE),
## seqs with indels only = seq in Site_Sequence_Gaps_Allowed col only
Indel_Only = if_else(is.na(Site_Sequence_Gaps_Allowed), TRUE, FALSE),
## seqs with two possible reps = seq in both cols
Sub_or_Indel = if_else(is.na(Sub_Only || Indel_Only), FALSE, TRUE))
Number of sequences with substitutions only Number sequences with indels only Number of sequences with two possible representations (substitutions or indels)
## create df to plot off-target ratio
off_target_ratio <- summary_df %>%
group_by(changeseq_run, lab, target_site) %>%
summarise(total_count = sum(Nuclease_Read_Count),
total_on_target = sum(on_target),
total_off_target = sum(off_target),
off_target_ratio = total_off_target/total_count)
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)
## create df to plot alignment of off-target seq resolution details
off_target_seq_align_df <- summary_df%>%
group_by(changeseq_run, lab, target_site) %>%
summarise(total_sub_only_seqs = sum(Sub_Only),
total_indels_only = sum(Indel_Only),
total_sub_or_indel = sum(Sub_or_Indel))
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)
NOTE CONSIDERING SEQUENCES THAT HAVE SUBSTITUTIONS ONLY… On-target to top 25% reads (on-target read count/top 25% off-target read count)
## create df of all off-target rows sorted by decreasing read count value
off_target_df <- summary_df %>%
mutate(Substitution_Only = if_else(is.na(Site_Sequence), FALSE, TRUE),
on_target = if_else(Site_Substitution_Number == 0, Nuclease_Read_Count, 0)) %>%
filter(Substitution_Only == TRUE) %>%
arrange(-Nuclease_Read_Count)
## take the top 25% of the off-target df - based on read count value
top_quarter_off_target_df <- off_target_df %>%
## Use top_frac to get the top 25% - use group_by to capture within run or site analysis
top_frac(0.25, wt = Nuclease_Read_Count) %>%
group_by(changeseq_run, lab, target_site) %>%
summarise(top_quarter_offtarget_readcount = sum(Nuclease_Read_Count))
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)
On-target to most prevalent off-target read (on-target read count/off-target read with largest read count)
## create df to plot ratio between on-target read count and most prevalent off-target read count per sample
on_top_off_df <- summary_df %>%
mutate(Site_Substitution_Number = if_else(is.na(Site_Substitution_Number), -1, Site_Substitution_Number),
on_target = if_else(Site_Substitution_Number == 0, Nuclease_Read_Count, 0),
off_target = if_else(on_target == 0, Nuclease_Read_Count, 0)) %>%
group_by(changeseq_run, lab, target_site) %>%
summarise(on_target_count = max(on_target),
off_target_max = max(off_target)) %>%
mutate(on_off_max_ratio = on_target_count/ off_target_max)
## `summarise()` regrouping output by 'changeseq_run', 'lab' (override with `.groups` argument)
Read count for seqs w/ 1 substitution from target sequence Read count for seqs w/ 2 substitutions from target sequence Read count for seqs w/ 3 substitutions from target sequence Read count for seqs w/ 4 substitutions from target sequence Read count for seqs w/ 5 substitutions from target sequence Read count for seqs w/ 6 substitutions from target sequence
## create df to plot the binned read counts of off-target substitution numbers
substitution_readcount_df <- summary_df %>%
mutate(Substitution_Only = if_else(is.na(Site_Sequence), FALSE, TRUE)) %>%
filter(Substitution_Only == TRUE,
Site_Substitution_Number > 0) %>%
group_by(changeseq_run, lab, target_site, Site_Substitution_Number) %>%
summarise(substitution_read_count = sum(Nuclease_Read_Count))
## `summarise()` regrouping output by 'changeseq_run', 'lab', 'target_site' (override with `.groups` argument)
## Making substitution count data frame wide for join
sub_counts_wide_df <- substitution_readcount_df %>%
mutate(Site_Substitution_Number = paste("sub",Site_Substitution_Number, "count",
sep = "_")) %>%
pivot_wider(names_from = Site_Substitution_Number,
values_from = substitution_read_count, values_fill = 0)
## Combining summary metric data frames
summary_table <- left_join(off_target_ratio, off_target_seq_align_df) %>%
left_join(top_quarter_off_target_df) %>%
left_join(on_top_off_df) %>%
left_join(sub_counts_wide_df)
## Joining, by = c("changeseq_run", "lab", "target_site")
## Joining, by = c("changeseq_run", "lab", "target_site")
## Joining, by = c("changeseq_run", "lab", "target_site")
## Joining, by = c("changeseq_run", "lab", "target_site")
## write out summary table
## Note - do not need to use paste with here, here converts the vector of character strings to a file path
write_tsv(summary_table, path = here("data", "summary_stats_table.tsv"),
na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
## Potentially use the DT or other packages for visually pleasing table presentation
summary_table %>% DT::datatable()
Summary Stats Table Column Headers
[1] “changeseq_run”: lab identifiers, sample letter ID, target site
[2] “lab”: 3-letter identifiers for wet lab, sequencing lab, and bioinformatics lab; NIST, St. Jude
[3] “target_site”: target site
[4] “total_count”: total read count for that sample (both on and off-target)
[5] “total_on_target”: total read count for the on-target sequence for that sample
[6] “total_off_target”: total read count for the off-target sequences for that sample
[7] “off_target_ratio”: total on-target read count/total off-target read count
[8] “total_sub_only_seqs”: total read count for the off-target sequences that are resolved in alignment with nucleotide substitutions only
[9] “total_indels_only”: total read count for the off-target sequences that are resolved in alignment with gaps (indels) only
[10] “total_sub_or_indel”: total read count for the off-target sequences that are resolved in alignment with substitutions or gaps
[11] “top_quarter_offtarget_readcount”: the total read count for the top quarter of off-target reads (sorted by decreasing read count)
[12] “off_target_max”: the read count for the most prevalent off-target sequence (off-target sequence with largest read count)
[13] “on_off_max_ratio”: on-target read count/read count for most prevalent off-target sequence
[14] sub_X_count: the unique substitution number found across the samples with the total number of occurrences of that read number in the off-target coordinates across each sample
For each target site, each lab’s total read count is plotted. This gives an idea how much data is generated as a whole for each target site. For each target site, each lab’s total on-target read count is plotted.
ggplot(total_count_df) +
geom_bar(aes(x = target_site, y = total_read_count, fill = lab),
position = "dodge", stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Editing Target Site", y = "Total Read Count",
fill = "Lab")
ggplot(target_count_df) +
geom_bar(aes(x = target_site, y = on_target_count, fill = lab),
position = "dodge", stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Editing Target Site", y = "Total On-Target Read Count",
fill = "Lab")
Sample F for NSS has a drastically lower read count. This was found to be a copy/paste error in their manifest file. They were pointing to sample E’s read fastq files.
For each target site, each lab’s on target rate (total on-target read count / total read count) is plotted.
ggplot(on_off_target_df) +
geom_point(aes(x = target_site, y = on_target_rate, fill = lab),
shape = 21, alpha = 0.5) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Editing Target Site", y = "On Target Rate",
fill = "Lab")
On-target rates for sample A, E, and F are nearly matched and variance for samples B, C, and D is observed. NIST-sequenced samples are slightly higher.
For each target site, each lab’s off-target rate (off-target read count / total read count) is plotted.
ggplot(off_target_ratio) +
geom_point(aes(x = target_site, y = off_target_ratio, fill = lab),
shape = 21, alpha = 0.5) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Editing Target Site", y = "Off Target Rate",
fill = "Lab")
The rates of the off-target sequences are generally all higher than that of the on-target rates. Like the on-target rates, samples A, E, and F are similar in value. Samples B, C, and D have variance with the inverse of the above - the off-target rates are higher for the St. Jude-sequenced samples than that of the NIST-sequenced samples.
For each target site, the number of off-target sequences that are resolved by substitutions only, indels (gaps) only, or substitutions or gaps are plotted for each lab.
off_target_seq_align_df %>%
## First make the table long
gather(key = "metric", value = "value", -changeseq_run, -lab, -target_site)
off_target_seq_align_df %>%
## First make the table long
gather(key = "metric", value = "value", -changeseq_run, -lab, -target_site) %>%
ggplot() +
geom_bar(aes(x = metric,
y = value,
fill = lab),
position = "dodge",
stat = "identity"
) +
## Split plot by summary metric
facet_wrap(~target_site, scales = "free_y") +
theme_bw() +
theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
labs(x = "Mismatch Type", y = "Number of Sequences", fill = "Lab")
The off-target sequences are aligned to the target sequence. These are binned in the following manner: in order to align properly to the target sequence, the off-target sequence could be resolve with substitutions (and assigned a substitution number), gaps (insertions or deletions) or by both possibilities. Here, I have binned them accordingly. As before, no differences are observed between the samples sequenced at St. Jude but run through the CHANGE-Seq bioinformatics pipeline by me and St. Jude, however, variance is observed between the samples sequenced at NIST vs. St. Jude. For samples A and B, NIST-sequenced off-target reads are overall in lower numbers as compared to St. Jude.
For each target site, weighted by read count, the top 25% off-target sequence read counts for each lab are plotted.
ggplot(top_quarter_off_target_df) +
geom_bar(aes(x = target_site, y = top_quarter_offtarget_readcount, fill = lab),
position = "dodge", stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Total Read Count of Top 25% Off-Targets", y = "Editing Target Site",
fill = "Lab")
The on-target read count divided by the most prevalent off-target sequence (the off-target sequence with the largest read count) for each sample is plotted for each lab.
ggplot(on_top_off_df) +
geom_point(aes(x = target_site, y = on_off_max_ratio, fill = lab),
shape = 21, alpha = 0.5) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Editing Target Site", y = "On-Target Rate to Most Prevalent Off-Target",
fill = "Lab")
As seen with the on-target rate considering total read counts, NIST has higher rates than St. Jude-sequenced samples.
For each substitution number (or for indels only or subs/indels which are grouped separately), the number of off-target sequences that are classified as such as such are plotted for each target sample and lab.
NOTE I currently have these plotting read counts but we may want to either replace this with or add another graph that shows occurrences for each sub number.
indel_df <- off_target_seq_align_df %>%
select(-total_sub_only_seqs) %>%
pivot_longer(names_to = "diff_cat", values_to = "read_count",
cols = c("total_indels_only","total_sub_or_indel"))
diff_cat_read_count_df <- substitution_readcount_df %>%
mutate(diff_cat = str_c("Substitution #", Site_Substitution_Number, sep = "")) %>%
select(-Site_Substitution_Number) %>%
rename(read_count = substitution_read_count) %>%
bind_rows(indel_df)
ggplot(diff_cat_read_count_df) +
geom_bar(aes(x = diff_cat,
y = read_count, fill = lab),
position = "dodge", stat = "identity") +
facet_wrap(~target_site) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Difference Type", y = "Read Count",
fill = "Lab")
As opposed to the typical venn diagrames, these show relative abundance and give appropriate scaling to visualize values.
NOTE NATE - can we add the values on the graph??
off_target_scatter_df <- off_target_df %>%
select(Chromosome, Start, End, Nuclease_Read_Count, target_site, lab)
off_target_lab_set <- off_target_df %>%
select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
group_by(`Genomic Coordinate`, target_site) %>%
summarise(lab_set = str_c(sort(lab), collapse = "-"))
## `summarise()` regrouping output by 'Genomic Coordinate' (override with `.groups` argument)
venn_df <- off_target_df %>%
select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = NA) %>%
left_join(off_target_lab_set)
## Joining, by = c("Genomic Coordinate", "target_site")
ggplot(data = venn_df) +
geom_bar(aes(x = target_site, fill = lab_set), position = "fill") +
coord_flip() +
theme_bw() +
labs(x = "Target Site",
y = "Proportion of Coordinates",
fill = "Lab Combinations")
To mimic the plots of Figure1, i in the CHANGE-Seq manuscript, common genomic coordinates of off-targets for each sample replicates are plotted in log scale.
## NSN vs. NSS: Concordant Off-Target Coordinates
nsn_v_nss <- off_target_scatter_df %>%
pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = 0) %>%
filter(NSS != 0, NSN != 0) %>%
ggplot(aes(x = NSN, y = NSS)) +
geom_point(alpha = 0.25) +
ggpubr::stat_cor(aes(label = ..rr.label..), method = "pearson") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~target_site, ncol = 1) +
theme_bw()
## NSN vs. NNN: Concordant Off-Target Coordinates
nsn_v_nnn <- off_target_scatter_df %>%
pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = 0) %>%
filter(NSN != 0, NNN != 0) %>%
ggplot(aes(x = NSN, y = NNN)) +
geom_point(alpha = 0.25) +
ggpubr::stat_cor(aes(label = ..rr.label..), method = "pearson") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~target_site, ncol = 1) +
theme_bw()
## NSS vs. NNN: Concordant Off-Target Coordinates
nss_v_nnn <- off_target_scatter_df %>%
pivot_wider(names_from = "lab", values_from = "Nuclease_Read_Count", values_fill = 0) %>%
filter(NSS != 0, NNN != 0) %>%
ggplot(aes(x = NSS, y = NNN)) +
geom_point(alpha = 0.25) +
ggpubr::stat_cor(aes(label = ..rr.label..), method = "pearson") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~target_site, ncol = 1) +
theme_bw()
ggpubr::ggarrange(nsn_v_nss, nsn_v_nnn, nss_v_nnn, ncol = 3)
The read count values for the off-target coordinates unique to replicate 1, replicate 2, and the intersection of the replicates was extracted. Unique read count values from all 3 data were then used to tally each occurrence across the unique genomic coordinates for replicate 1, replicate 2. and the intersection of the replicates
Our traditional way of viewing
nnn_nsn_lab_set <- off_target_df %>%
select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
group_by(`Genomic Coordinate`, target_site) %>%
filter(lab != "NSS") %>%
summarise(lab_set = str_c(sort(lab), collapse = "-"))
## `summarise()` regrouping output by 'Genomic Coordinate' (override with `.groups` argument)
rep_count_scatter_df <- off_target_df %>%
select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
filter(lab != "NSS") %>%
right_join(nnn_nsn_lab_set) %>%
count(target_site, lab_set, Nuclease_Read_Count, name = "Occurence")
## Joining, by = c("Genomic Coordinate", "target_site")
ggplot(rep_count_scatter_df) +
geom_point(aes(x = Nuclease_Read_Count, y = Occurence, fill = lab_set), shape = 21) +
scale_x_log10() +
facet_wrap(~target_site, scales = "free_y", ncol = 1) +
theme_bw()
Density view
nnn_nsn_lab_set <- off_target_df %>%
select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
group_by(`Genomic Coordinate`, target_site) %>%
filter(lab != "NSS") %>%
summarise(lab_set = str_c(sort(lab), collapse = "-"))
## `summarise()` regrouping output by 'Genomic Coordinate' (override with `.groups` argument)
rep_count_scatter_df <- off_target_df %>%
select(`Genomic Coordinate`, target_site, lab, Nuclease_Read_Count) %>%
filter(lab != "NSS") %>%
right_join(nnn_nsn_lab_set)
## Joining, by = c("Genomic Coordinate", "target_site")
ggplot(rep_count_scatter_df) +
geom_histogram(aes(x = Nuclease_Read_Count, fill = lab_set)) +
scale_x_log10() +
facet_wrap(~target_site, scales = "free_y", ncol = 1) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is plotting the proportion of sites (occurrence / total read count) as opposed to just occurrence number as an attempt to normalize
rep_count_scatter_df %>%
mutate(lab_set = factor(lab_set, level = c("NNN-NSN", "NNN","NSN"))) %>%
ggplot() +
geom_histogram(aes(x = Nuclease_Read_Count, fill = lab_set), position = "fill") +
scale_x_log10() +
facet_wrap(~target_site, scales = "free_y", ncol = 1) +
theme_bw() +
scale_fill_brewer(type = "qual", palette = 2) +
labs(x = "Read Count", y = "Proportion of Sites")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 331 rows containing missing values (geom_bar).
sessioninfo::platform_info()
## setting value
## version R version 4.0.1 (2020-06-06)
## os macOS Catalina 10.15.5
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2020-08-21
sessioninfo::package_info() %>%
filter(attached = TRUE) %>%
select(package, loadedversion, date, source) %>%
knitr::kable(booktabs = TRUE, row.names = FALSE)
| package | loadedversion | date | source |
|---|---|---|---|
| abind | 1.4-5 | 2016-07-21 | CRAN (R 4.0.2) |
| assertthat | 0.2.1 | 2019-03-21 | CRAN (R 4.0.0) |
| backports | 1.1.8 | 2020-06-17 | CRAN (R 4.0.0) |
| blob | 1.2.1 | 2020-01-20 | CRAN (R 4.0.0) |
| bookdown | 0.20 | 2020-06-23 | CRAN (R 4.0.2) |
| broom | 0.5.6 | 2020-04-20 | CRAN (R 4.0.0) |
| car | 3.0-9 | 2020-08-11 | CRAN (R 4.0.1) |
| carData | 3.0-4 | 2020-05-22 | CRAN (R 4.0.2) |
| cellranger | 1.1.0 | 2016-07-27 | CRAN (R 4.0.0) |
| cli | 2.0.2 | 2020-02-28 | CRAN (R 4.0.0) |
| colorspace | 1.4-1 | 2019-03-18 | CRAN (R 4.0.0) |
| cowplot | 1.0.0 | 2019-07-11 | CRAN (R 4.0.2) |
| crayon | 1.3.4 | 2017-09-16 | CRAN (R 4.0.0) |
| crosstalk | 1.1.0.1 | 2020-03-13 | CRAN (R 4.0.2) |
| curl | 4.3 | 2019-12-02 | CRAN (R 4.0.0) |
| data.table | 1.12.8 | 2019-12-09 | CRAN (R 4.0.0) |
| DBI | 1.1.0 | 2019-12-15 | CRAN (R 4.0.0) |
| dbplyr | 1.4.4 | 2020-05-27 | CRAN (R 4.0.0) |
| digest | 0.6.25 | 2020-02-23 | CRAN (R 4.0.0) |
| dplyr | 1.0.0 | 2020-05-29 | CRAN (R 4.0.0) |
| DT | 0.14 | 2020-06-24 | CRAN (R 4.0.2) |
| ellipsis | 0.3.1 | 2020-05-15 | CRAN (R 4.0.0) |
| evaluate | 0.14 | 2019-05-28 | CRAN (R 4.0.0) |
| fansi | 0.4.1 | 2020-01-08 | CRAN (R 4.0.0) |
| farver | 2.0.3 | 2020-01-16 | CRAN (R 4.0.0) |
| forcats | 0.5.0 | 2020-03-01 | CRAN (R 4.0.0) |
| foreign | 0.8-80 | 2020-05-24 | CRAN (R 4.0.1) |
| fs | 1.4.2 | 2020-06-30 | CRAN (R 4.0.1) |
| generics | 0.0.2 | 2018-11-29 | CRAN (R 4.0.0) |
| ggplot2 | 3.3.2 | 2020-06-19 | CRAN (R 4.0.0) |
| ggpubr | 0.4.0 | 2020-06-27 | CRAN (R 4.0.2) |
| ggsignif | 0.6.0 | 2019-08-08 | CRAN (R 4.0.2) |
| glue | 1.4.1 | 2020-05-13 | CRAN (R 4.0.0) |
| gtable | 0.3.0 | 2019-03-25 | CRAN (R 4.0.0) |
| haven | 2.3.1 | 2020-06-01 | CRAN (R 4.0.0) |
| here | 0.1 | 2017-05-28 | CRAN (R 4.0.0) |
| hms | 0.5.3 | 2020-01-08 | CRAN (R 4.0.0) |
| htmltools | 0.5.0 | 2020-06-16 | CRAN (R 4.0.0) |
| htmlwidgets | 1.5.1 | 2019-10-08 | CRAN (R 4.0.0) |
| httr | 1.4.1 | 2019-08-05 | CRAN (R 4.0.0) |
| jsonlite | 1.7.0 | 2020-06-25 | CRAN (R 4.0.0) |
| knitr | 1.29 | 2020-06-23 | CRAN (R 4.0.1) |
| labeling | 0.3 | 2014-08-23 | CRAN (R 4.0.0) |
| lattice | 0.20-41 | 2020-04-02 | CRAN (R 4.0.1) |
| lifecycle | 0.2.0 | 2020-03-06 | CRAN (R 4.0.0) |
| lubridate | 1.7.9 | 2020-06-08 | CRAN (R 4.0.0) |
| magrittr | 1.5 | 2014-11-22 | CRAN (R 4.0.0) |
| modelr | 0.1.8 | 2020-05-19 | CRAN (R 4.0.0) |
| munsell | 0.5.0 | 2018-06-12 | CRAN (R 4.0.0) |
| nlme | 3.1-148 | 2020-05-24 | CRAN (R 4.0.1) |
| openxlsx | 4.1.5 | 2020-05-06 | CRAN (R 4.0.2) |
| pillar | 1.4.4 | 2020-05-05 | CRAN (R 4.0.0) |
| pkgconfig | 2.0.3 | 2019-09-22 | CRAN (R 4.0.0) |
| purrr | 0.3.4 | 2020-04-17 | CRAN (R 4.0.0) |
| R6 | 2.4.1 | 2019-11-12 | CRAN (R 4.0.0) |
| RColorBrewer | 1.1-2 | 2014-12-07 | CRAN (R 4.0.0) |
| Rcpp | 1.0.5 | 2020-07-06 | CRAN (R 4.0.1) |
| readr | 1.3.1 | 2018-12-21 | CRAN (R 4.0.0) |
| readxl | 1.3.1 | 2019-03-13 | CRAN (R 4.0.0) |
| reprex | 0.3.0 | 2019-05-16 | CRAN (R 4.0.0) |
| rio | 0.5.16 | 2018-11-26 | CRAN (R 4.0.2) |
| rlang | 0.4.6 | 2020-05-02 | CRAN (R 4.0.0) |
| rmarkdown | 2.3 | 2020-06-18 | CRAN (R 4.0.0) |
| rprojroot | 1.3-2 | 2018-01-03 | CRAN (R 4.0.0) |
| rstatix | 0.6.0 | 2020-06-18 | CRAN (R 4.0.2) |
| rstudioapi | 0.11 | 2020-02-07 | CRAN (R 4.0.0) |
| rvest | 0.3.5 | 2019-11-08 | CRAN (R 4.0.0) |
| scales | 1.1.1 | 2020-05-11 | CRAN (R 4.0.0) |
| sessioninfo | 1.1.1 | 2018-11-05 | CRAN (R 4.0.2) |
| stringi | 1.4.6 | 2020-02-17 | CRAN (R 4.0.0) |
| stringr | 1.4.0 | 2019-02-10 | CRAN (R 4.0.0) |
| tibble | 3.0.2 | 2020-07-07 | CRAN (R 4.0.1) |
| tidyr | 1.1.0 | 2020-05-20 | CRAN (R 4.0.0) |
| tidyselect | 1.1.0 | 2020-05-11 | CRAN (R 4.0.0) |
| tidyverse | 1.3.0 | 2019-11-21 | CRAN (R 4.0.0) |
| vctrs | 0.3.1 | 2020-06-05 | CRAN (R 4.0.0) |
| withr | 2.2.0 | 2020-04-20 | CRAN (R 4.0.0) |
| xfun | 0.15 | 2020-06-21 | CRAN (R 4.0.1) |
| xml2 | 1.3.2 | 2020-04-23 | CRAN (R 4.0.0) |
| yaml | 2.2.1 | 2020-02-01 | CRAN (R 4.0.0) |
| zip | 2.1.0 | 2020-08-10 | CRAN (R 4.0.1) |